Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark

Beijing Normal University, Australian National University, Beijing 101 Education Group
✉️Correspondence to fangweizhong@bnu.edu.cn
Heat Image

Abstract

Concepts serve as fundamental abstractions that support human reasoning and categorization. However, it remains unclear whether large language models truly capture such conceptual structures or primarily rely on surface-level pattern memorization. Existing benchmarks are largely static and fact oriented, which limits their ability to probe fine-grained semantic understanding and makes them vulnerable to data leakage and overfitting. To address this limitation, we introduce CK-Arena, a dynamic benchmark for conceptual knowledge evaluation based on a multi agent social deduction game, namely the Undercover game. In this setting, LLM based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others’ statements. Model performance is evaluated through both game level outcomes and the semantic quality of generated descriptions. Furthermore, CK-Arena leverages the interaction process to automatically construct high quality question answering data for fine grained diagnostic analysis. Experimental results show that conceptual understanding varies substantially across models and categories, and is not strictly aligned with overall model capability.

CK-Arena Demo: Undercover Game

Below is an interactive demonstration of the Undercover game used in CK-Arena. We first introduce the basic game rules to help you understand, and then demonstrate the interaction of intelligent agents in the first round of a real experiment. In this game, LLM agents are assigned either the main concept ("bee") or an undercover concept ("butterfly"). Players take turns making statements about their concept without revealing it directly. The goal for the civilians is to identify and eliminate the undercover agents through voting, while undercover agents try to blend in without being detected.

Game Flow :
1. Role Assignment: Players are randomly assigned as civilians or undercover agents, each receiving a similar but distinct concept.
2. Concept Description: In each round, players take turns describing their concept while trying to hide their identity and infer others’.
3. LLM Evaluation: Statements are scored by LLM judges based on novelty, relevance, and reasonableness.
4. Threshold-Based Elimination: If a player’s score falls below a predefined threshold, they are automatically eliminated.
5. Voting Round: After a fixed number of rounds, all surviving players vote to eliminate one player based on the dialogue so far.
6. Win Condition Check: The game ends when:
[All undercover agents are eliminated → civilians win]      [Undercover agents equal civilians → undercover wins]      [Maximum number of rounds is reached]

Experimental Results

We have presented some experimental results here. For more comprehensive and specific information about the experiments, please refer to the article.

Model Leaderboard

Leaderboard of LLMs in CK-Arena. Each player starts with an initial rating of 0. After stabilization, a player consistently defeating 0-rated opponents converges around 420, which serves as a reference for strong performance. The leaderboard highlights relative differences across 18 evaluated LLMs.

Relevance Heatmap

Relevance scores of different LLMs across various categories. In this heatmap, the darker the color, the higher the score, intuitively reflecting the association between the descriptions and concepts of each LLM in different categories.

Static Leaderboard

QA benchmark results. Three open-source models were selected for this evaluation, and the results reflected the specific performance of different models on a single task. The consistency between QA benchmark ranking and dynamic game ranking can also indirectly demonstrate the reliability of dynamic evaluation.

Win Rate Performance

The win rate performance of six LLMs across 11 categories. A comparative analysis reveals that each model exhibits distinct strengths and weaknesses across different concept categories. These variations are likely influenced by differences in training data, architectural design, and optimization strategies specific to each model. The analysis reveals models' focus areas, knowledge gaps, and insights for improving conceptual reasoning.

t-SNE Visualization

t-SNE visualizations of LLM statements across concept categories. Each plot shows model outputs for (a) Animals, (b) Food, and (c) Electronics. Repetitive descriptions, reflecting shallow understanding, appear as tightly clustered points, whereas richer knowledge produces more dispersed distributions. The visualizations also indicate that different LLMs center their descriptions on different focal aspects of a concept, suggesting variation in how conceptual knowledge is represented.

Interactive Knowledge Graphs

Explore concept relationships extracted from LLM-generated descriptions. Select a category to view its interactive knowledge graph.

BibTeX

If you need to cite our work:

@article{xu2025probe,
  title={Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark},
  author={Shuhang Xu and Weijian Deng and Yixuan Zhou and Fangwei Zhong},
  journal={arXiv preprint arXiv:2505.17512},
  year={2026}
}